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This Memorandum is a first draft of an essay on the simplest, "learning" 
processes. Ccements aft invited. Subsequent sections will treat, among 
othe r things: 


The "stimulus-sampling" model of Estes 
Relations between Perceptron- ty pe error 
reinforcement and Bayes fan-type correlation 
reinforcement 

and some other statistical methods viewed in the sane way. 


1. E> Dec 1 s Ions based dh fixed set of tests 


The svent£ to be classified lie in a set of classes. 



^RH 




IS 


We are given ulso a set (ip.{X)} o£ ‘"tests * 1 or **experiment s'" that can be 
performed on the event that has just occurred, Define 



*<X) = <q) l CX> ,Cp^ (XL) 


9 *■ ► * 9 


to be the sequence of results of these tests, taken in snmO fixed order. 

Usually v?e will restrict each q> to be Boolean--that is, to have values 0 or 1. 

Thus the machine M cannot “see? 1 the real events fl directly* lr but only 
through the results of the experiments There is really no alternative 
to this Soft of indirectness h because only mystics believe in the 
possibility of the immediate nnd direct experience of reality--and 
whether or not this is in fact desirable, they are for soraO reason 
unable to' transmit to others their grounds for believing thijf. 

Given the outcome $(X) of a set of e K per bnent b, we would like to guess 

which class F. contains the event X that has juEt occurred. In £om« situations 

v?e tan be sure which F was responsible^ for examples in the case that there 


is only one X that could have produced this particular value of i\ {It is 
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convenient to think of 4 = ((p^^,... ae a vector, so tkat wO tan talk 

about its value instead of the sequence of values of all the <p’b.) 

iA Jlninf, i?e cannot usually be sure which F^ was responsible. We 
shall -consider the- rather general case in which ovt knowledge about the 
situation can be sunsnafiied in the f*m of a table, nr distribution, of 
probabilities^ 


FCF^/I) 


the probability that X is in , 
given that the experiment S lutve produced 


5 = (?]_’ 



Moat of this chapter is devoted to discussing ways in which this sort o£ 
information could be acquired throURh experience. First, however, wO will say 
a few words about how wv Could use this information if we had 11!" 

1„1 Costs and decision criteria 

If a particular * " occura. and if we know that 

(*iVV - 1 

{ PCF k /* 0 ) - 0 ft ¥ ]> 


there ie no question that and we can say so definitely. But if 

0 < E < 1 we have to guess, Usually, one will choose.that 1 for which 


Ffg /* Q ) U latKOSt, 


Th i? gue ee will have thc &ina Hast ctiance pf be jng 


wrong—iO-ci-li'ed 1 ' max inium like 1 ihood- cr i tet ion . M 
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Cut in real situations, just trying to he right is not the only gpa1 that 
mfiy hsve to b« con a idcr t-d, Different kinds of mistakes can have grossly 
different consequences. It can be mote important to avoid the tiger than to 
acquire th£ lady. L£t 

" the cost of guessing, when X is really in Fy 
Then the "expected" Cost* if we guess F k £gtven 4 )is 


aw - 5 


£5 


That is, this^the average price one will pay h over n large collection of 
experiment b, if one has the policy of guessing F^ whenever one sees that 
pStticuUr $ . Clearly, our policy then should be to choose that k for which 
CCk, ]j 1 e_ E-Tfia 1 Le st . 

Only in t|rc most peculiar circumstances would it bo that a correct guess 
C tJ . would cost more than any Incorrect guess ^ (ifj). But one might have 
a situation in which 



1 


(all i) 

(all j> 

an t tfo 
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bo that one might as well always guess F^ (regardlet# of Che value of 
there is no pr^ni™ on he tog tight otherwise. In the case m is/Lii 1 ! 


Sine* 



(i/i) 


we have 


E t.^PCFj/^ - 1 ■ Poy« 


(since I E(F|/'I) ” ljand minimizing this is 3 as it should be* the same a« 


3 


TTiastmtJCtng F(F^/4) . i.e* , the maximum likelihood strategy, 

1.2 Inverting -probabilities:; Bayes 1 zhmtvnijt 

We have described decisions based upon knowledge -of the probabilities 
P(F./G), Usually, one starts with access Co a different set, P(§/Fj), of 
probabilities; P{i^/F.) is the probability Chat will oCCut if X5F^. For 
estample, understanding of the rnechsnianS by which X 1 s in each 

|f^ affect the devices that compute the ip. r a theoretical 

calculation of what the P(4,P )''a ought to be„ Fortunately, one can calculate 
the p(Fyi)'s from these as followa; By the definition of the conditional 
probability symbo1 


P(EM> 


F(A A B1 

P(A> 
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where P^E A A) ie the "Joint” probability that both events 5 end A oCCutn But 
then, alto 


P(A/E) 


p(w 


because F(A A B|j = P(£ A A) 
Therefore 


F(*/FJ>P£F J 


1 




il 


Since ■= E[p(F^>P<l/F^)J is the probability that i will occur, regardless 

k 

of which F^ occurs, we can write 


F(i/F )-P(F ) 

PCF j / ^ * E[F<4/F fc )-P(FJJ7 


(A) 


showing that we can calculate the F(F /i)"a if wo know the Pii/P.J'a and 

J J 

also the ”a priori” probabilities F(F^) of occurrences of £' e of the different 
types. These ace^ of Course, determined by the circumstances, sod cannot be 
calculated from a theory of how the cp^ devices work. 

The simple formula (A) is useful when one wants, so to speak, to Invert 
the roles of "cause" and “efface 11 when one has either data or theory that goes 
In the wrong direction. We note that in mdny application one needs only to 
know the relative magnitudes of the different P(F./#)'s, and since the 
denominators of (AJ does not depend on j, one Kaj bh* vJvt. 
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(A’) 


i* -fie Ur-JcaA"- 


1.3 Problem* of imnlemeuting the probability approach 

Xhtre are a variety of practical and theoretical objections to the use 
of formula (A). It is difficult* even in the simplest situations, to know 
enough about the P(^/F^>"s to use this raetbod. One might he able to 
deduct them from a theoretical, model of the situation,, but then one would most 
likely be able to construct a more sophisticated decision method involving 
less computation. If there ire many possible i f a, then it becomes impractical 
to estimate the P(f,F)'s on the basis of empirical observation. The amount of 
data would be too enormous. And* the system i* completely unable to "guesi" 
on §'s it has not seen before. Finally* even if one had a complete catalog nf 
the P(§.F)"a, knowledge in this form is so free of structure that it would be 
very hard to adept it to & similar* new situation. To be useful, knowledge 
has to be cast Into structured models. 

The simplest and most common kinds of models are the "parametric 
distributionsIn this family of t echniques---which include many standard 
methods of statistics, one has a procedure in which one 

(aJ Assumes some form for the dist ribut ions of the ( i’/F) ’s , 

(b) fits thfc data to estimate the parameters of these distributions, 

(c) design* a decision procedure based on the theory of the assumed 


distribution forms. 
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R;< aruji 1 e ; One thinks of the $ = (<p^*...,[p ) as''pomfr in 
space with the usual Cartesian distance mettle. 

Assume that each set j>Qt)|X4F..} fotm* a "cluster" 
A symmetric normal distribution. 


a vector 

with {say ) 



Each cluster has (say) the same variance (concentration) but 
different means (centere). 

Then one can use the data to estimate the variance And means. 

In this model* che decision of which Fj a given 1 should be 
assigned to can be made by finding, which F-mean is closest to I. 
Thla, and many other related strategies are discussed at length 
in Nilsson's "Learning ffach in e s ," a book on the theory o£ a variety 
of statistical and threshold decision methods. 


The trouble with the parametric models, and their relatives, is that even 
those that ate the moat sophisticated contain so little 

structure that they are usually thoroughly unsuitable for representing 
detailed knowledge about anything. (For sample, they cannot satisfactorily 
represent finite-state processes.) 

Nevertheless» we shall proceed to study the simplest seek, models 

in which the qj r s are statistically independent. Our conviction is that unless 
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this simple "linear 1 ' case Is thoroughly undetstood, one can have little 
chance of making pood "intuitive 11 jodgements about more complicated systems. 

1>. 4 Independence 

We can evade seme of the problems men tinned in § 1,1 if we aaglipe 
that the tests ip^X) are statistically independent for each F^ T MathematicsUy 
this ntans that for spy Sequence $(X) » (jp (X)„ H ( . tv <X)) at values yf 4 ve 
can assert that 

P<tp 1 (3I) A ... A <P B C3[)/P J ) = F(^ 1 (X)/F j >* J .,*F((p m (I)/F.) 
for each j r More compactly we s£y 

FC*/F J - TT reem./!P 5* fB ) 

j h J 

Informally, given that has occurred* P(i^/F ) gives the distribution 
Of each of the ip's. If one is further told the values of e U me pf the p'a * 
independence means that this g;ivcs absolutely no further Information about the 
values of_t,hg remainjnjt ip 11 a ," We want to emphasise that this is a most 
stringent conditio 

Experimentally one wijki get independence when there are variations in 
responses of qJf ieruv^ af "noise 11 

or me a sur eirient uncert cint lea ■within the^p-uieeha n i amt. ^ #ne would not fhprt. 

the valves of iffie (p’s -J-r r help predict those of £"hipv.i 
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But where the variations iii 4 td» given F, » due to selection □ t different it’s 

J A — — ' 

from the same I'-class , one would not ordinarily assume independence, since the 
value of one of the cp 1 s tells something about which X in F has occurred, and 
hence could help St least partly to predict how another 9 


will behava. 



An extreme example of nor-in4apenden.ee is the following, in which there 
are two functions, <(; and Cf^ t and two classes h F^ and F^, 
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■P|<*> = 

a pare random variable 

with F{ip (JQ - 

1) = 

1 

7 j 



Its value la determined 

by tossing a 

coin. 




not by X. 





v z W - 

r if Xet' L 






L 1 - VjC*) If xcf 2 * 




Then 








/ f.) • i- 




Sat 


p(X/f.) =i 





Wf.) •- 





P(^ r /f;)' 

W F.) = £ , 




Notice that 

neither u, ncr ytaken alone stive 

a ny in f o rma tin 

n what 

ever about 

F! Each appeals to be s 

: random coin toss- Eat 

from both one 

e-Sn d 

et ermine 

petfectly which p has oc 

curredj for 




■cvkJc 

*1 " *2 

V 





h *<h 

—> P 2 





with a ban? lute ccrtainlty. 
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Reni^rk: We wt 11 assume only lndependencC within each class F , 
IT X ranges over s-eyer-sl F's then knowing one (^--valui* can help 
guess another. For example, suppose that 


<?1 = ^ = & if XtF^ 

(P| = V 2 “ 1 lf Kt¥ 2' 


The two (p 1 s can {in fact are) independent on each F, B kL t if we 
did opt know that XeF ^ and we then to id that = Q , ue could 
indeed then predict that ^ = 0 also, without this violating our 
independence assumpticnr If wt had been told that XeF t we could kiv* 
already predict?)the value of tp , and in that case learning the 
value of ;p^ would not have helped! 


2.0 The linear iriEiximumi livelihood estimator 


JuffSSc 


We will assume that the (f 1 S are independent {for each F,), We have jv: 

J rt J ' 

observed a caae in which •$ = (<p n .. > and we want to know which F has 

1 m j 

the greatest probability^ Which is largest of 


P(F 1 /i),...,P(F n /i>! 


Now, using formula (A) of §1,2, this is equivalent to choosing the J for which 
P(O/F ) - F(Fj> is largest. For brevity, define 


MU.J 


Pij = P(v ± - vV 


= p(pj). 


We will dfacoSS only Boolean qp 1 athat fa, each (p has value 0 or l 
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Then, using {E) of §1.4, tan define 


Pij-IT 

J <fL=l 


£s the quantity to be maximised. It is formally convenient to multiply and 
divide by {1-p,,) to obtain 

V 1 


Pj 1 IT (ir^“)’ "TT a-p y 

J »-l li ^ij^ ail 1 J 


for then *e k^t ok.L fa ^ j m«:£lW‘Ee5 


Bj-Tf (p*-) 
J V 1 ^ 


(C) 


B" '■* ^ ifea.'V Jots wpl iipM'sJ Ow. 5. iv 

the influence of the actual experiment is concentrated in the product 


expression. The term* 



are the "edda" or “likelihood ratios" of getting ^ = l if XeF., 

ft ta formally Convenient now to replace -ft) kj logarithm 

bee a vse sums are move familiar than products. This changes only 
the form, no-t the content j, since toftfx) increase; when it dues, wo still select 


the maximum of 
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CD) 



& 

1e Aptly called the ,H W | eij!;ht of the evidence" of in favor o£ Now we 

can write 


V + t "i] 

J * -1 LJ 


* \ * E w it 'i 
J all i LJ 


where All that docs not depend en the outcome of the actual experiment $ is 

absorbed into the constant kj which is therefore a sort of "a priori"" weight 

for F- (as opposed to the J w tp term which is "a posteriori" 1 ) + 

J j ^-J *■ 

The important conclusion le that th« decision can he made upon the basis 


of a linear combination of the terms. This is due directly to the assumption 
we r^de about the independence of the p 's for each 

L i 

There are slight assymmetries in out formula that come from 
treating cp^“l AS an occurrence of an event and as a 

non-occurrence of the same event. This is quite arbitrary, and makes 


W 


by 1. J. Good 
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it hard to return to the more general situation in which the 
eerinent s fcp^ 3 could each have many values, Resides, our 
algebra introduces a quite unnecessary risk of dividing t'. 
zero'. But on the whole, we gain an, heuristically clearer 
picture and, with this insight, it would hr easy enough to 
return to the original formulae for repairs. 

Formally, one can go slightly further in eifppliiying the 

formulae. Create A new ’’experinane" cp^ wh value 1 e always 

In Define ^ by Then if we redefine the vector I to he 

(<5> rt ,9 ) end define vectors U = (w.-v, ,) our 

u L ^ j yj Ij raj 

procedure i £2 choose th-at j £&r vhich 

Wj-f 


r ta maximal. Later we will give a more“or“less meaningful 
geometric interpretation to this vector product formaLtsiuj 
^ but right hare it is not very illuminating. 

Formula (D) of f!2,0 euggeets the design of a machine for taking our 
decisions 
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where D is a device chet aimpltj dec ides which of its inputs is the Largest. 

Each qj-deviee emits a standard-siaed pulse (if sp(3t) = 1) when K is presented. 
The poises are m.u.111 pi led by the w. . quantities as indicated, and summed at the 
boxes. 


Return inn for the rroment to c n ate , if we combine the observations of 
•gl, I and § 2.0 we Will want to minimise (for R) 


Z'jjL-^TTW; 


«1 


EJ 


wher*V4 j = l* 1 j/<I-P t j)- 11 ta interesting that 
also lends itself Co the 


this more complicated procedure 
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2n 1 The two-class case 

l£ there *re only two elates and F^ the decision formula bccosaes 
very simple; we choose If 

c j_ + “ W tl (p. > e 2 + t w_ 2 , 

i I 


i. e., when 


f ( ”u - V ft * c 3 ' c l 


(A) 


which has the form 


t a > 0, 
l 1 i 


(B) 


& decision function of this eort is called a "linear threshold function" 
because the -choice depends upon whether the linear function exceeds the 

"threshold 1 ' 1 It is certainly of the simplest ways to make a decision that 

is not entirely trivial. 


Kotice that ve derived tHe Formula on the basis of some very strong 
aeauapt ion#-- notably the Independence of the p^S. These will not 
be true in generals and. indeed it will be exceptional that the 
Conditions will be Close enough to make the formula (A) useful. 

Tire forotula {&) ia sttvewhat Ewre ganeral--and therefore somewhat 
more widely useful in that,, while it retains the linear threshold 
form* it does not require that be precisely 
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u-p 

(W 



and suggests that even if the statist teal assumptions of independence 
do not bold, there might be otheT values of a.^ that could be used. 

This is sometimes true. 

3.0 EatliiLatinft tlwj ll 
3.1 Laplace estjmatIon 

The most obvious way to estimate p^ ia to present some events in and 
record the value of <? , If K events occur and we observe H occurrence a of 



we can estimate that 


naturally, this estimate will aive different values for different samples; 
it ha* a statistical distribution. In feet, it will have a binomial distribution 
about the true mean, with variance 


a 2 - djpl. 

■rfe will not derive this well-known result, but rflost readers will 
recognise that it ia plausible because 

(1) if n-1 the distribution has two points whose weights 

and distances from P *>'« 

0^ » <t-p)p 2 + p(l-p) 2 

“ pO--p) 
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(it) one is used to the uncertainty decreasing with the 
square root of sample size, 

(lit) the binomial, distribution Lb so simple and fundamental 
that one would not aspect any other factor to enter, 

{end of mathematical joke) 

To use this we have to keep * record of H, To M update 1 ’ the estimate 

[wj 

after each observation dtnot* M^k o+i^ik ef P ^ " 

If 4 ;1 +W 



while if :p = □ ► 


(i 


1 

m-ijr 


Both can be summarised by 



iy k*y u v -n!.vv df- ^ (dr N-tk. pts f ^ fl ■ 

T"iliV S^N another way tg estimate What would happen if 

wc replaced the multipliers (1 - ^) and ^ by constants that don^t depend on 
JT--the number of trials? It 1 b desirable to make these con&tantE etill add 
to unity jjy+inj [t-tf J do, bo that the estimated "probabilities" will 


also add to unity. 



1-19 


3,2 Reinforcement e a t Ltt»atlan 

We ere estimating the probability that = 1- 

we TCvise our t-th estimate by the formula 


After each event 



= « J^+ fl - 8) qP 



where & is a constant between 0 and l. For simplicity vs begin by setting 



Ey applying the formula repeatedly vs obtain 


p® = + (1 - *><f* 

(P = + (1 - «)[^ + 

0°^4 <1 - &)[l^4 0^"^+ + ... 4- e n-t C^] 

■which we can write as 

p " . a - «>[^+ 8^-0+ ... * iTSP* ~j .$] 

1^4 0^' ^ 4 ,., 4 + X?& 

i +0 + .,7+ e n ' 1 + 

E 
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yfiere the denominator it equal to 1/(1 - B). The last formula is written to 
show more clearly that fp 1 -! can he expressed as a stifle weighted sum, i.e. h 
average of the qf^s. Ue have two reasons for writing the ijuti in this form: 

(1) Exponential decay of "memory 11 

The formula ahawa that the value of a ft time n is 
{a) a weighted average of the 

(b) the weights, i.$.» the influence af older event? falls 
off exponentially • Tor ' we can 


u-9) ^-9 

0 


and we can write 




where the derivative can be interpreted as showing huw much cF depend* on the 


tf that occurred t moments before. The first term 




always retains slightly 


mpre weight (by a factor of —-) but it, too, la subject to the same decay r 


( 2 ) 


The proccd ure JLs an 0 stImator for p 

Tlris fallows from the general theorem that if K 1 ,... n t are 

l- El 


Independent random vnriablcE and E^JCj is tht mean or "expected 11 value of 
randon variable, then 
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The exponential fore has advantages in some 4ituatfcens : 

U) We do not have to state the number of trials* and 

(2) The es Elite tor lend* itself to "adaptation” for changing situations. 

On tho other hand* it does not optically use the experience in non-changing 
situations. 

We can apprise the efficiency of (Bj in using; its data by computing its 
variance--tho mean square error about p h 
3,3 The variance of the reinforcement estimator w. 

Suppose that K has the distribution f (X) with mean M and Is aubject to 
the transformation 


r - ex + a - e>f 


where prob (9 “ 1) = p. Then the distribution of is 


ft*) “ P&U) +■ Cl - F>px 


Ll(X-) 


■| fU - and £JQC) = i £<|). 


(*) 


where 
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Then the variance o£ £ r (X) la (aa shown below) 



0 * 

U 


+ <1-P> 


i 




If the variance of f te then from (I.) we have 


hence 




L-3 + *i f 




Oj, = & 2 ^ + p(i-p) 


(ti > 


2 2 

This is linear irt Q. with siopt 9 < L, hence iteration must lead to a Limit 

("fliej”) point for the variance. At this Liait we must have 
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Proof o f <^I3 

<4,- J f'(X>(K- kf1 > 2 * J 1 £mW - 

= pJo(X]]t 2 + (L-!>)]> (X )! 2 - E^+a-PKi 2 

“ PtJaW - + w-l ~ ?V a 

+ (i- P )[Jpa)x 2 -ti 2 ] + a-p)^ - (i-pjV 2 

- 2p{l-pHyi- 

= p<^ + (i-p)cj + p(l-p) (p^Pp) 2 - 
3. h The role of HI as a memory t Irae-conStant 

m 

When the reinforcement process lias- been, in operation for a very long 

f\ 

time, the variance of its expected value does not approach zero (as it doejS 

for the Lplacs estimate when M increases beyond bound), This is because it 
the 1 p^-T f" 

depends more^upon the remote past; indeed the very nose recent terra can alter 
the current value by more than l-$* a constant* ^n the uniform weight 

process the effect is less than ^ which becoaea arbitrarily small,) If we 
"equats" the variances 


we get 


r.tl.-p) 

n 


MW) |jf 


n 


J±0 

l-$ ‘ 
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Fyr 3IH-3L!.. I values of 9 this 14 a Little more than unity, showing {correctly) 
that the estimate is basod alrao-Et entirely upon the last event. In fact, in 
this tiise 

+ (1-9)4^ - /l 


For large 0, i.e., close to unity* we hav* 

2 


bo that if 9 =■ 1 - " then the variance is about that one would obtain by 

m 

Simple averaging of the Last 2 m aamplesj Thus one can think of the quantity 
7"t as corresponding roughly to a time constant for “forget ting. pf 
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3.5 The Samuel compromise 

In his classical paper about "Some Studies in Machine Learning using 

the Game of CheckerSj 11 Arthur L, Samuel uses an ingenious coiuhlnatten of 

probability estimation methods. In his application it nccas tonally happens 

that a new evidence term ;p . is introduced (and an old pure is abatedoued because 

it has not been of much value in the decision process). When this happens there 

is a problem of preventing violent fluctuations, because after one or a few 

trials the term's probability estimate will have a large variance as compared 

with older terms that have better statistical records- Samuel usee the 

££ I 

following algorithm to ’'stabilize 11 his system: he sets p — — and 


where 


,K- n 

N = 16 If t <, 32 

N - if t a S t < 

S - 25b if 256 £ fc. 


TbvSj- i n the beginning the estimate is made as though the probability had 
already been estimated to be -j on the basis of several, i.e., the order of 16, 
trials* Then, in the "middle" period, the algorithm approximates the uniform 
wteghting procedure. Finally (when t ~ 256) the procedure changes to the 
exponential decdy mode, with fixed F, So th^t recent experience can outweigh 
earlier result*. The use of powers of two represents a convenient computer- 
program, technique for do eg this* 
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En Samuel's system, the terms actually used have the farm 

$.2$. I 

so that the r[ OEt ImaLor" ranges in the interval -ISC * + 1 and can 

n 

be treated sb a "correlation coefficient," I mention this here only 
to justify Samuel's initial setting of «<**> to Or In 

his context, this totting makes perfect sense „ lAereas in out 
interpretation the setting of p^" J to y wonl(i he arbitrary, 

3■ 6 Variants of the rei n f c- r Jfcemcn t estimator 

Consider the "reinforcement pro-cess" 


a f = to + (l - 0)cp 


(R t ) 


If the -distribution of Of has mean p then the distribution! of a' will be 


- U-p)B11 + p{to + (1-0)) 


betauSO there is probability (|-p) that ip=Q and probability p that <pl + Then 


|ui f = %\ ■+ (l-0)p. 

Applying this again we get for the mean of (s') f > 
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M rr = &{0u + (i'&)i>> + <i-e>p 

* e 2 u + tt-*)pU+&) 

- ^ + <i-b 2 >p 

and slmiLarilyj. If we apply n times, we get 

+ {l-0 n ) r p 

Clearly as n "* =, -* p r 

This analysis cap be replaced by a more general method.,, by using the 
following two Simple observations; 

Iitnmt 1 ; If a 1 * f(S,ip) la linear in cp then If y& (a) = m and p(<p=l) then 

Fropf * n<«') = a-p)f(m,0> 4- pVf{m,l) 

= f(m,Q) + ptffffl.l) - f(nuO)} 

* ,>±"L„l*fipp|sk (K 

Lemma 1 -. If [ < 1 then the limit of a,r(a), f(f£rt)) 3 f^(a> exists 

and is the (unique) solution of the equation 


y ■ f(y>p)- 
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Proof: This is a “'f is erf pcinE theorem 1 ' and the diagrams below sliow i Aiy it 
is true; 




Nov ve can apply these to the formula R^ ; , and have only to solve 


y = 3y + (1- 6> p 
flr 
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* ‘ m p *• 

«Ut « «« -i-~J 3 a— ;■ tv ««wt .f«. — 1 ^ ' ) '" ,!, ‘ i t-«'« 

But with the lemma a, we can aleo analyse some other systems. Another 
interest ins one ifi 




0(a + dp) 




and we have 


or 


i-e 


y = fl(y + p> 
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so that this* too* is) an estimator for p which has to he corrected 

0 

by the constant factor - 

Another* somewhat different fCinforccr is 


{ 


o*l 

if 

(pi 

fib. 

if 

qp-0 


(R 3 ) 


which can be written as 


a' “ ^ + Ci((p- + £ - qjQ). 


This satisfies the conditions of the Lemmas * since 


tg^l = i^e-epef = ! tr-o> | < 1 


so ve have 


y - p + y(p + 6 - pQ) 

y '(wfe)' 

This is an estimator (with the correction factor) of the L ih e t ihooc? 

ratio | L j^V It is interesting that this is eq easily obtained by a reinfortemen t 
process as simple- as; "if 4(> ™ 1 occurs* add 1 tu q; 1 f tp- = 0 occurs* muLtipLy 
b by : 
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Another simple forn Is 


ta 1 * Su 4 ^ 


<v 


which leads to the estimate 


y-XTe-P 


Finally,, one might con el der the very simple form 


O' 


a + hu 


fKg) 


This "diverge*, 11 i.«., the a'e grow beyond bound (and do not satisfy the 
condition for Lenma 2). Still, If one is making decisions by comparing 
different p^'s, one can use the ratios of these simple Scores" rts likelihood 
ratio e, to obtain the ^uniform weighting" type of behavior * (Je inciude R r 
only to indicate its heuristic similarity to the other*. 

•3h 7 A simple "synaptic” reinforccr theory 

Let ns make a simple ''neuronal model . 11 The model is to "account 11 for the 
to1lowf ng phenomena: 

1 + There i£ to be a quantity Q that estimates p_ - /f >; 

i J i j 


2 + The only information available are the occurrence of = 


xeF 


1 and 
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f. -» 



The bag contains a very high and constant content rat ion of a substance A. 

When or F^ occur--or "f tre 1, -“£he walls of the corresponding, bags fi, and/or 

become '* permeaMe" to A for a moEienfc, If 9j alone occurs, nothing really 

changes,. because Is surrounded by the impermeable Cy If F, alone fir oh* 

it Loses Some A by diffusion to the outside; in fact, if a is the amount of A 

in C. it may be assumedf by the usual laws of diffusion and concentration)to 
J ^ 

lose some fraction fl-0^ of q: 


Cr 1 * ftt *4 


an. {J 

% ' O 


If both and are active then approximately the same leas will occur from 
t^. &ut an essentially constant amount b will be ’"injected" from to Cy So 


F- 


Cf" * (fcf + b 




i (Wi! 


f T ;l 


Wc can assume that Jb is Constant because the concentration of A is very high in 
compared to that im Or one can invent any number of similar variations* 

IP any eaga we get 


a' = St e + -ipb 
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6$ that in the licit the mean of ti will approach 


b 

rw p 


crtiich is proportional to* and hence an esCirsator of, p^ = £iab{<jj./F ). 

Th.ua the simple geometry together with the idea of a membrane breaming 
permeable briefly following a nerve impulse gives ua a quantity that 
le an estimator of the appropriate probability. 

How could this representation of probability be translated into a useful 
neuronal mechanism? One could imagine all sorts of Schemes; ionic 
concentrations--or rather* their logarithms!—Could become Membrane potentials, 
or conductivities* or oven probabilities of occurrences of otEier chemical 
events. The ’’anatomy 11 and n physio logy 1 ' of our model could easily be modified 
to obtain F processes and their attendant "likelihood ratios," Indeed, it is 
so easy to imagine varlants--tbe idea is so insensitive- to detsils--that I don K t 
propose it to be considered seriously* except as a family of simple yet 
intriguing models that a neural theorist should have available. 



